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Abstract —In this paper, we generalize Huber’s criterion to 
multichannel sparse recovery problem of complex-valued mea¬ 
surements where the objective is to find good recovery of 
jointly sparse unknown signal vectors from the given multiple 
measurement vectors which are different linear combinations of 
the same known elementary vectors. This requires careful char¬ 
acterization of robust complex-valued loss functions as well as 
Huber’s criterion function for the multivariate sparse regression 
problem. We devise a greedy algorithm based on simultaneous 
normalized iterative hard thresholding (SNIHT) algorithm. Un¬ 
like the conventional SNIHT method, our algorithm, referred 
to as HUB-SNIHT, is robust under heavy-tailed non-Gaussian 
noise conditions, yet has a negligible performance loss compared 
to SNIHT under Gaussian noise. Usefulness of the method is 
illustrated in source localization application with sensor arrays. 

I. Introduction 

In the multiple measurement vector (MMV) model, a single 
measurement matrix is utilized to obtain multiple measure¬ 
ment vectors, i.e., -I- e^, i = 1,... ,Q where 

= (01 = (</’(!) ••• is an M X N 

measurement matrix and are the (unobserved) random noise 
vectors. Typically there are more column vectors 0^ than row 
vectors 4>[j)7 i-C-, M < N. The unknown signal vectors x^, 
i = 1,... ,Q are assumed to be sparse, i.e., most of the 
elements are zero. In matrix form, the MMV model is 

Y = $X-fE, (1) 

where Y = (yi ■•■yg) S X = (xi xg) e 

£NxQ E = (ei • • • eg) e collect the mea¬ 

surement, the signal and the error vectors, respectively. When 
Q = 1, the model reduces to standard compressed sensing 
(CS) model m. The key assumption of MMV model is that 
the signal matrix X is AT-rowsparse, i.e., at most K rows 
of X contain non-zero entries. The row-support of X is the 
index set of rows containing non-zero elements, supp(X) = 
{i S W} : Xij ^ Oforsomej}. When X is K- 

rowsparse, i.e., |supp(X)| < K, joint estimation can lead 
both to computational advantages and increased reconstruction 
accuracy; See 111, IS, 10, 0, 0, 0. 

The objective of multichannel sparse recovery problem is 
on finding a row sparse approximation of the signal matrix X 
based on knowledge of Y, the measurement matrix $ and 
the sparsity level K. Such a problems arises in electroen¬ 
cephalography and magnetoencephalography (EEG/MEG) 0 
blind source separation 0, and direction-of-arrival (DOA) es¬ 
timation of sources in array and radar processing fS), lH, ifTOll . 
Many greedy pursuit CS reconstruction algorithms have been 
extended for solving MMV problems. These methods, such as 


simultaneous normalized iterative hard thresholding (SNIHT) 
algorithm 0 are guaranteed to perform very well provided that 
suitable conditions (e.g., incoherence of $ and non impulsive 
noise conditions) are met. The derived (worst case) recovery 
bounds depend linearly on ||E|| 2 , so the methods are not 
guaranteed to provide accurate reconstruction/approximation 
under heavy-tailed non-Gaussian noise. 

In this paper, we generalize Huber’s criterion HU cf. 
Section 7.7, 7.8] (often referred to as ’’Huber’s approach 
2”) originally developed for overdetermined linear regression 
(M > N, Q — 1) model to the complex-valued case and for 
the more general multivariate sparse regression problem. This 
requires generalizing robust M-estimates of regression (and 
loss functions) for complex-valued case. In Huber’s devise, 
one estimates the signal matrix and scale of the error terms 
simultaneously. This is necessary since most robust loss- 
functions require an estimate of the scale. Using Huber’s 
criterion in the MMV model one may elegantly estimate 
both the sparse signal matrix and the scale of the errors 
simultaneously. In particularly, we are able to circumvent 
the problem of obtaining a preliminary robust scale estimate 
which is a challenging problem in ill-posed multivariate sparse 
regression model since the support of X and hence the 
contributing elementary vectors of on measurements are 
not known. In earlier related work Huber’s approach 2 has 
been considered for Lasso-type real-valued linear regression 
setting in ini, la and real-valued compressed sensing in 
m. Eor our multichannel sparse recovery problem, we devise 
SNIHT algorithm which results in a simple, computationally 
efficient and scalable approach for solving the MMV sparse 
reconstruction problem. 

Let us offer a brief outline of the paper. In Section m 
we give necessary notations and definitions as well as provide 
motivation and background of robust sparse recovery problem. 
Robust complex-valued loss functions and their properties 
are outlined in Section |III] and a generalization of Huber’s 
loss function for complex measurements is given. Then, in 
Section |IV] we formulate Huber’s criterion for MMV model 
and the related SNIHT algorithm, called HUB-SNIHT, is 
derived in Section [V] Einally, we illustrate the usefullness of 
the method in source localization application in Section [VT] 

H. Background 

A. Notations 

Eor a matrix A G cAfxiv index set F of cardinality 

|r| = K, we denote by Ar (resp. A(r)) the M x K (resp. 
K X N) matrix restricted to the columns (resp. rows) of A 


indexed by the set F. The ith column vector of A is denoted 
by Ri and the hermitian transpose of the ith row vector of A 
by a(i), A = (ai ■ • ■ aAr) = (a(i) • • • a(jv/))^- Furthermore, 
if / : C ^ C, then /(A) refers to element-wise application of 
the function to its matrix valued argument, so /(A) G ([^mxn 
with [f{A)]ij = f{a^j). 

The usual Euclidean norm on vectors will be written as || ■ ||. 
The matrix space is equipped with the usual Hermitian 

inner product 

M N 

(A,B)=Tr(BHA)=^^a.,6F 

i=i i=i 

where the trace of a (square) matrix is the sum of diagonal 
entries. We define the weighted inner product as 

M N 

(A, B)\v = ^ ^ ^ ^ 

i=l 3 = 1 

where W is M x A real matrix of positive weights. Note 
that (A, B)w reduces to conventional inner product when 
W is a matrix of ones. T he Frobe nius norm is given by the 
inner product as ||A|| = ^/(ATA)" and ||A||w = a/(A, A)w 
denotes the weighted Frobenius norm. The row-fg quasi-norm 
of A is the number of nonzero rows, i.e., || A||o = | supp(A)|. 
Hence the assumption that the signal matrix X G is K- 

rowsparse in the MMV model is equivalent with the statement 
that ||X||o < K. 

We use Hk{-) to denote the hard thresholding operator. 
for a matrix X G retains the elements of the 

K rows of X that possess largest ^ 2 -norms and set elements 
of the other rows to zero. Notation X|r refers to sparsified 
version of X such that the entries in the rows indexed by set 
r remain unchanged while all other rows have all entries set 
to 0. 


B. Robust constrained optimization problem 


At least two problems arises when using conventional 
robust loss functions in (|2]i. First, commonly used robust loss 
functions in robust statistics such as Huber’s or Tukey’s loss 
functions require an estimate of scale a. Obtaining a reliable 
robust estimate of scale is a difficult problem. It involves 
obtaining a A-rowsparse robust preliminary estimate Xg of 
the signal matrix and then computing robust scale estimate 
based on the resulting residual matrix Rg = Y — #Xg. 
Second problem is that robust loss functions are defined in the 
real-valued case and some thought must be given on special 
properties of complex-valued loss functions. These problems 
are addressed next in Section |III] and Section |IV] 


III. Foss FUNCTIONS: COMPLEX VALUED CASE 

We start by giving a proper definition of a loss function p. 

Definition 1: Function p : C —?> K.,^ is called a loss 
function if it verifies: 


(LI) p is circularly symmetric, p{e^^x) = p{x), \/9 G R. 

(L2) p(0) = 0. Furthermore, p is M-differentiable function 

and increasing in |e| > 0. 


Let us first note that condition (LI) is equivalent with the 
statement 

p{x)=Po{\x\) (3) 

for some pg : Rj ^ R||. The fact that Q => (LI) is obvious 
and the converse can be derived by invariance arguments. This 
illustrates that p is not C-differentiable (i.e., holomorphic or 
analytic function). This is of course natural since only func¬ 
tions that are both holomorphic and real-valued are constants. 
The complex derivative of p w.r.t. x* = {xr + jxj)* is 




A 1 / , 


dxi J 


which will be referred in the sequel as the score function. Since 
p(e) = po(|e|), we can write ip using basic rules of complex 
differentiation IMI in the form 


Suppose that the error terms are i.i.d. continuous 
random variables from a circular distribution Ea with p.d.f. 
/(e) = (l/a)fo(e/a), where /g(e) denotes the standard 
form of the density and ct > 0 is the scale parameter. If 
the scale is known, then a reasonable approach for solving 
the simultaneous sparse recovery problem is to minimize a 
distance criterion of residuals. 


D, 


v^) 


M Q /_ ' 


2=1 j = l 


a 


( 2 ) 


for some suitable loss function p(-) subject to A-rowsparsity 
constraint ||X||g < A. For conventional least squares (LS) loss 
function p(e) = |ep, the scale can be factored out from the 
objective function, and the minimization problem reduces to 

imn IIY - $Xf subject to ||X||g < A. 


where 


= ^Poi\x\)sign{x), 


sign(e) 


f e/|e|, for e ^ 0 
\o, for e = 0 


is the complex signum function and pg denotes the real 
derivative of the real-valued function pg. In order to make 
minimization of (|2]) possible by simple gradient descent type 
algorithms, we narrow down the set of loss functions by 
imposing the assumption: 


(L3) p : C —>■ Rj[" is a convex function 

For example, the conventional LS loss function p(x) = \x\'^ 
verifies assumptions (L1)-(L3). In this case, pg(r) = and 
the score function is 'fii.x) = x. In this paper, we assume that 
the loss function verifies (L1)-(L3). 


The well-known problem with LS minimization is that it gives 
a very small weight on small residuals and a strong weight on 
large residuals, implying that even a single large outlier can 
have a large influence on the obtained result. 


We define Huber’s loss function in the complex case as 


PhA^) 



for |e| < c 
for |e| > c, 


(4) 







where c is a user-defined threshold that influences the degree 
of robustness and efficiency of the method. Huber’s function is 
a hybrid of £2 and £i loss functions, using f 2 -loss for relatively 
small errors and fi-loss for relatively large errors. It verifies 
conditions (L1)-(L3). Huber’s score ('0-)function is 

Je, for |e| < c 

'iPH,c{e) = i . f . f 11 ^ 

l_csign(e), for \e\ > c 

Note that Huber’s ip is a winsorizing (clipping) funtion; the 
smaller the c, the more clipping is actioned on the residuals. 


Recall that notation ip (11) refers to element-wise application 
of ■i/j-function to its matrix valued argument, so [^/>(R)]ij = 
ip(rij). Thus if p is convex and the MMV model is overde¬ 
termined with non-sparse X, solving the above M-estimating 
equations would give the global minimum of (|5]l. 

The scaling factor a in Q is chosen so that the obtained 
scale estimate d is Fisher-consistent for the unknown scale a 
when eij ^ CA/'(0,ct^), which due to (|7]l is chosen so that 

a = E[x(e)], e^CAA(0,l). 


IV. Huber’s criterion for multichannel sparse 

RECOVERY 

As discussed earlier, the scale a of the error terms is 
unknown and needs to be estimated jointly with the signal 
matrix. We discuss here how this can be done elegantly 
using Huber’s approach 2. First note that Maximum likelihood 
(ML-)approach for solving the unknown X and a leads to 
minimizing the negative log-likelihood function of the form 


For many loss functions, a can be computed in closed-form. 
For example, for Huber’s function (|4|i the ^-function in ® 
becomes 


Xff.c(e) = |V'ff.c(e)P 


Hep, for|e|<c 
\c^, for |e| > c, 


and the concistency factor a = a{c) can be easily solved in 
closed-form by elementary calculus as 


QMLO^,cr) = (MQ) log cr -t- 

where p{e) = — log/o(e) depends on the underlying standard 
form of the density /o(e) of the error terms. Then, one could 
replace the ML loss function p with a robust loss function 
which need not be related to any circular density /o(-)^ s-g-^ the 
Huber’s loss function. The negative log-likelihood function is 
however not convex in (X, a). This follows since QmlO^, <t) 
is not convex in cr (for fixed X) and hence cannot be jointly 
convex. 


a = c2(l-F^.(2c2))+F^.(2c2). (9) 

Note that a depends on the threshold c. We will choose 

threshold c as = (l/2)F~2^{q) for q G (0,1). The 

X 2 

rationale behind this choice is that under Gaussian errors, 
2|ep/CT^ ^ xi- Hence a sensible choice is to determine c so 
that 2c^ is the qth upper quantile of the xi'distribution. The 
choice <7 —1, implies ^ 00 and hence no-trimming of 
the residuals. In our simulations we use q = 0.8 which yields 
c = 1.269. The smaller the c (and hence q) the more trimming 
is actioned on residuals. 


Huber im proposed an elegant devise to circumvent the 
above problem. See also 112 for further study of Huber’s 
approach. We generalize the Huber’s approach 2 for the 
complex multivariate regression case and minimize 


V. SNIHT ALGORITHM EOR HUBER’S CRITERION 
Our aim is at solving 


tlU \ " j 

where a > 0 is a fixed scaling factor. Important feature of 
the objective function is that it is jointly convex in (X, a) 
given that p is convex. In addition the minimizer X preserves 
the same theoretical robustness properties (such as bounded 
influence function) as the minimizer in the model where a 
is assumed to be known (fixed). This is not the case for the 
ML-objective function Qml(P^,o')- 

The stationary point of ® can be found by setting the 
complex matrix derivative of Q w.r.t. X* and the real derivative 
of Q w.r.t. cr to zero. Simple calculations then show that 
the minimizer (X,(t) is a solution to a pair of M-estimating 
equations: 


min(5(X,cr) subject to ||X||o < K. 

X,cr 

This problem is combinatorial (i.e., NP-hard) but greedy pur¬ 
suit approaches can be devised. Thus due to biconvexity of the 
objective function, we can use Huber’s loss function pH,c{e) 
and greedy pursuit NIHT algorithm can be devised to compute 
an approximate solution. Recall that NIHT is a projected 
gradient descent method that is known to offer efficient and 
scalable solution for X-sparse approximation problem ini- 
NIHT updates the estimate of X by taking steps towards the 
direction of the negative gradient followed by projection onto 
the constrained space. 

In Huber’s criterion, if we consider a fixed at a value a = 
fj-n+i (-{jjg value of cr at (n -f l)th iteration), the simultaneous 
NIHT (SNIHT) update of the signal matrix becomes 


^^iP 


1 


MQ 


M Q 

EE 


X 


i=l j=l \ 

where R = Y — $X and x : 

X{t) = Po(.t). 




R\ 

(6) 


^ ) 


where /r" 


(7) 

and 

}" is defined as 


will be 

t). 

(8) 

-Vx*p( 


X"+i = Rif(X" -f 


m = ip 


R’" 

T-n+l 


n+l 


ktt 


(^n-i-i )2 _ $^R”. The scale is updated 








(consider signal matrix X fixed at a value X = X") using 
(|7]i by a fixed-point iteration 


= 


(a-r 


1 


M Q 
i —1 j —1 


where R" = Y - $X" 


The pseudo-code for the SNIHT algorithm in the case 
that the loss function p is Huber’s function @ is given in 
Algorithm [T] We refer to this algorithm as HUB-SNIHT in 
the sequel. The steps 3-9 can be divided to 3 stages described 
below: scale stage (Steps 3, 4) build up the scale update cr"+^, 
signal stage (Steps 5, 7, 8, 9) build up the AT-sparse signal 
update X"+^ and the support r"+^, and step size stage (Step 7) 
computes the optimal stepsize update for the gradient descent 
move. The computation of the stepsize will be described in 
the next two paragraphs. Note that it is possible to tune the 
algorithm for different applications by simply altering the 
criterion for halting the algorithm. Matlab function is available 
at http://users.spa.aalto.fi/esollila/software.html 


Algorithm 1: HUB-SNIHT algorithm 


input : Y, €>, sparsity K, trimming threshold c. 
output : (X”+^, 0 -”+^, r"+^) estimates of X, cr and 
r = supp(X). 

initialize: X° = 0, /j,° = 0, n = 0, r° = 0, a = a(c). 

1 (7° = 1.201 • median(|j/y I,f = 1,..., M,j = 1,... ,Q) 

2 ro = supp(RK(^^V'ff.c(Y/aO))) 
while halting criterion false do 


6 

7 

8 
9 

10 


R" = Y - $X" 




R; = i’H.c 

G" = $“R 


1 

a MQ 

R' 


T-n+l 


M Q 

EE 

n+l 




■'ll) 


= CompStepsize(R", G, T”,/r”, 


jU+l ^ 

X"+i = + p 

r’^+i = supp(X"+i) 
n = n + l 


n+lf~in 


G") 


end 


As was noted in ifTTll . stepsize selection is very important 
for convergence and needs to be adaptively controlled at each 
iteration. Given the found support T" is correct, we choose 
IjTi+t fjjg nrinimizer of the convex objective function (|2|i 
for fixed scale at in the gradient ascent direction X" + 
/rG"|rn, i.e. 


^ip) ~ ^PH,. 
= 


Y - $(X" -b MG"|r") 


yU+l 


R" - pB^ 

rr-n+l 


( 10 ) 


where R" = Y — $X" and B" = This reduces to 

minimizing a simple linear regression (M-)estimation problem 
where the response is r = vec(R") and the predictor is b = 
vec(B"). It is easy to show (details omitted) that the minimizer 


p of L{p) is the unique solution to a fixed point (FP) equation 
p = H{p), where 

H{p) = ||B-||w2(^)Re((R”,B-)w(^)) dD 

where the right hand side depends on the unknown p via the 
weight matrix W(/r), defined as 


W(^) = WH,c 


R" - pB^ 


where wh,c is a weight function based on Huber’s loss 
function, defined as 


Wh,cA 


tpH,cA 

e 


( 1, for |e| < c 
\c/|e|, for |e| > c ’ 


If the loss function is LS-loss p(e) 


(equivalent to 


Huber’s function when c —oo), then the minimizer of (fTOl i is 
easily found in closed form since in this case W {p) is equal 
to a matrix of ones. Hence the FP equation is explicit and the 
solution is = ||G"p„j|p/||$r"G"p„j|p. This is indeed 

the same stepsize used in conventional SNIHT ||6l. 


For Huber’s loss function, the minimizer of (doll can be 
found by running the FP iterations until convergence (with 
initial value po > 0). Instead, we use approximate of the 
solution given by 1-step FP iterate with initial value given 
by the previous stepsize p". In other words, in Step 7, the 
update is computed as = i7(/r"). 


VI. Application to Source Localization 

We consider sensor array consisting of M sensors that 
receives K narrowband incoherent farfield plane-wave sources 
from a point source (M > K). At discrete time t, the array 
output (snapshot) y(/) € is a weighted linear combination 
of the signal waveforms x(t:) = ..., XK{t)A corrupted 

by additive noise e{t) £ y{t) = A{6)x{t) -b e{t), where 
A = A{9) is the M x K steering matrix parametrized by 
the vector 6 = {6i,... ,9 kA of (distinct) unknown direction- 
of-airivals (DOA’s) of the sources. Each column vector a(0i), 
called the steering vector, represents a point in known array 
manifold a(0). The objective of sensor array source localiza¬ 
tion is to find the DOA’s of the sources, i.e., to identify the 
steering matrix A(0) parametrized by 0. We assume that the 
number of sources K is known. 

As in lO, we cast the source localization problem as a mul¬ 
tichannel sparse recovery problem. We construct an overcom¬ 
plete M X N steering matrix A{9), where 9 = {Oi, ■ ■ ■, 0 nA 
represents a sampling grid of all source locations of interest. If 
9 contains the true DOA’s 6i, i = 1,..., K, then the measure¬ 
ment matrix Y = (y(/i) ••• yitg)) £ consisting 

of snapshots at time instants ti,... ,tQ can be exactly modelled 
as MMV model O, where the signal matrix X £ 
is AT-rowsparse matrix with source signal sequences as its 
non-zero row vectors. Thus identifying the source locations 
is equivalent to identifying the support F = supp(X) since 
any z S F maps to a DOA 9i in the grid. Since the steering 
matrix A{9) is completely known, we can use HUB-SNIHT 
method to identify the support. 

We assume that K = 2 independent (spatially and tem¬ 
porally) complex circular Gaussian source signals of equal 
















power cr^ arrive on an uniform linear array (ULA) of M = 20 
sensors with half a wavelength inter-element spacing from 
DOA’s 01 = 0° and 02 = 8°. In this case, the array manifold 
is a(0) = The noise 

matrix E € has i.i.d. elements following inverse 

Gaussian compound Gaussian (IG-CG) distribution IfTSl with 
shape parameter A = 0.1 and unit variance. CG-IG distribution 
is heavy-tailed and has been shown to accurately model radar 
clutter in IfTSl . Note that the covariance matrix of the snapshot 
is Cov(y(fi)) = alA{9)A{9)^ + Im, so we may use the 
popular MUSIC method to localize the sources. In other words, 
we search for K = 2 peaks of the MUSIC pseudospectrum 
in the grid. We use a uniform grid 9 on [—90, 90] with 2° 
degree spacing, thus containing the true DOA’s. For the source 
localization application, we make the following modifcation to 
the algorithm: In Step 1 of HUB-SNIHT algorithm, we locate 
the K largest peaks of rownorms of c(Y) instead of 

taking r° as indices of K largest rownorms of c(Y). 

We then use SNIHT, HUB-SNIHT and MUSIC to identify 
the support (which gives the DOA estimates) and compute 
the empirical probability of exact recovery (PER) rates and the 
relative frequency of DOA estimates in the grid based on 1000 
MC runs. Full PER rate = 1 implies that the support T (and 
hence DOA’s) were correctly identified in all MC trials. Such 
a case is shown in upper plot of Figure [T] for HUB-SNIHT 
when the number of snapshots is Q = 50 and the SNR is —10 
dB. The PER rate of HUB-SNIHT was 0.99, but PER rates of 
SNIHT and MUSIC were considerably lower, 0.81 and 0.94, 
respectively. In the second setting, we lower the SNR to —20 
dB. In this case, the conventional SNIHT and MUSIC methods 
fail completely and provide nearly a uniform frequency on the 
grid. This is illustrated in the middle plot of Figure 1. Note 
that the robust HUB-SNIHT provides high peaks on the correct 
DOA’s. The PER rates of SNIHT, HUB-SNIHT and MUSIC 
were 0.02, 0.48 and 0.01, respectively. Thus only HUB-SNIHT 
is able to offer good localization of the sources whereas the 
non-robust methods do not provide much better performance 
than a random guess. In the 3rd setting, we alter the set-up of 
1st setting by decreasing the number of snapshots from Q = 50 
as low as Q = 5. The performance differences between the 
methods are now more significant as is illustrated in the lower 
plot of Figure [T] In this case the PER rates of SNIHT, HUB- 
SNIHT and MUSIC were 0.19, 0.57 and 0.37, respectively. 
Again, the HUB-SNIHT performed the best. 
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